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ISOLATED-WORD  SPEECH  RECOGNITION  USING  MULTI-SECTION 
VECTOR  QUANTIZATION  CODE  BOOKS 

I.  INTRODUCTION 

Vector  Quantization  (VQ)  is  a  data  compression  principle  [1]  with  several 
successful  applications,  including  speech  coding,  [2,3,4]  image  coding  [5,6], 
and  speech  recognition  [7,  8,  9, 10, 11. 12. 13, 14, 15, 16, 17].  In  previous  work  on 
speech  recognition  [8,  9, 16],  we  developed  a  method  in  which  isolated  words  are 
classified  by  means  of  the  average  distortion  that  results  from  encoding  them 
with  VQ  code  books.  In  this  paper,  we  present  a  generalization  of  that  method. 
The  generalization,  which  improves  recognition  performance  and  reduces  com¬ 
putational  requirements,  was  motivated  by  work  of  Martinez,  Riviera,  and  Buzo 
[101- 

In  our  previous  approach  [16],  a  VQ  code  book  is  generated  for  each  word  in 
the  recognition  vocabulary  by  applying  an  information-theoretic,  iterative  clus¬ 
tering  technique  [18]  to  a  training  sequence  containing  several  repetitions  of  the 
vocabulary  word.  This  clustering  process  removes  all  time-sequence  informa¬ 
tion  from  the  training  sequence  and  represents  each  vocabulary  word  as  a  set  of 
independent  spectra.  An  input  utterance  is  classified  by  encoding  it  with  every 
code  book  and  finding  the  code  book  that  yields  the  smallest  average  distortion. 
Because  the  average  distortion  does  not  depend  on  the  sequence  of  input  speech 
frames,  this  approach  performs  isolated-word  recognition  entirely  without 
time-alignment. 

With  just  four  spectra  in  each  code  book,  our  previous  approach  achieved 
97.7%  accuracy  for  speaker-dependent  recognition  of  a  twenty-word  vocabulary 
[16].  With  eight  spectra  in  each  code  book,  the  accuracy  increased  to  98.8% 
[18].  These  results  showed  that  much  more  can  be  done  without  time-sequence 
information  than  is  commonly  assumed.  For  suitably  chosen  vocabularies, 
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characteristic  spectra  contain  enough  information  for  recognition,  and 
information-theoretic  clustering  does  a  good  job  of  extracting  that  information 
from  training  data. 

To  improve  recognition  performance  and  to  decrease  computational  com¬ 
plexity,  we  have  been  investigating  ways  of  incorporating  time-sequence  infor¬ 
mation  into  the  recognition  procedure.  Here,  we  present  results  for  a  new 
method  that  incorporates  time-sequence  information  by  means  of  sequences  of 
VQ  code  books  that  we  refer  to  collectively  as  multi-section  code  books.  A 
separate  multi-section  code  book  is  designed  for  each  word  m  the  recognition 
vocabulary  by  dividing  the  words  in  the  code  book's  training  sequence  into 
equal-length  sections  and  designing  a  standard  VQ  code  book  for  each  section. 
Unknown  words  are  classified  by  dividing  them  into  appropriate  sections,  per¬ 
forming  VQ  on  a  section  by  section  basis,  and  finding  the  multi-section  code 
book  that  yields  the  smallest  average  distortion.  The  new  approach  reduces  to 
our  previous  approach  when  the  number  of  sections  >s  reduced  to  one.  Hence¬ 
forth.  we  refer  to  our  previous  approach  as  the  single-section  case.  Preliminary 
results  for  the  multi-section  approach  were  reported  in  [12.  1?]. 

VQ  has  also  been  used  by  others  to  reduce  the  computational  and  memory 
requirements  of  existing  isolated-word  recogmtion  approaches  '7, 11, 13,  14, 15]. 
In  these  approaches,  spectra  from  a  single,  large  VQ  code  book  are  used  to 
replace  the  spectra  of  both  input  speech  frames  and  stored  reference  data.  Our 
approach  is  quite  different,  both  because  we  design  separate  code  books  for 
each  word  in  the  recognition  vocabulary,  and  because  we  avoid  standard 
methods  of  time  alignment. 

After  explaining  our  speech  recognition  approach  in  Section  II,  we  describe 
the  data  base  and  experiments  in  Section  III.  Section  IV  contains  the  results  for 


speaker-independent  recognition,  and  Section  V  contains  results  for  speaker- 
dependent  recognition.  We  discuss  computational  considerations  in  Section  VI, 
and  we  present  some  general  conclusions  in  Section  VII. 

II.  APPROACH 

In  this  section,  we  give  background  information  and  describe  the  multi- 
section  approach.  We  begin  by  describing  VQ  and  explaining  its  role  in  our 
isolated-word  recognition  approach.  We  then  discuss  distortion  measures,  linear 
prediction  parameters,  and  figures  of  merit. 

A  Vector  Quantization 

VQ  is  an  information-theoretic  data  compression  principle  introduced  by 
Shannon  in  the  late  1950‘s  [19].  For  a  specified  transmission  rate.  VQ's  objective 
is  to  find  the  set  of  reproduction  vectors,  or  code  book,  that  represents  an  infor¬ 
mation  source  with  minimum  expected  "distortion".  The  data  compression  is 
achieved  by  transmitting  a  reproduction  vector  index  rather  than  the  original 
source  vector.  In  general,  the  selection  of  a  perceptually  meaningful  distortion 
measure  and  the  construction  of  an  optimal  code  book  are  difficult  problems. 
For  speech,  however,  good  choices  exist  [2, 3]. 

Speech  coding  by  VQ  is  a  narrow-bandwidth  speech  coding  technique  based 
on  linear  piedictive  coding  (LPC)  [2,3].  Using  estimates  of  the  sample  auto¬ 
correlation  function  that  are  measured  in  each  frame,  the  shape  of  the  speech 
spectrum  in  each  frame  is  encoded  as  the  index  of  a  prestored  set  of  LPC 
parameters  that  define  an  autoregressive  model  and  is  called  a  codeword.  The 
LPC  parameters  used  are  the  inverse  filter  gain  squared  a2  and  the  linear 
predictive  coefficients  a*.  t=l,  ■  ■  •  ,M ,  with  a0=l.  The  collection  of  possible 
codewords  is  called  a  code  book.  Let  C  =  jC|,C2,  .C# j  be  a  code  book  of  N 

codewords  Q,  each  defining  an  autoregressive  model  and  comprising  a  set  of 
LPC  parameters.  Let  Sj  be  the  autocorrelation  estimates  from  the  j th  frame  of 


the  speech  to  be  coded.  Then  the  spectrum  shape  of  the  ;th  frame  is  coded  by 
identifying  the  codeword  CJ,  that  "best  represents"  Sj  according  to  the 
"nearest-neighbor  rule" 

d\Sj.(+)  =  mincf(5J-,q),  (;) 

for  some  distortion  measure  d . 

Vector  quantization  code  books  are  designed  to  minimize  the  average  dis¬ 
tortion  that  results  from  encoding  a  long  training  sequence  of  speech  frames.  In 
particular,  if  Tj,j  =  1,  •  •  •  ,L  is  such  a  training  sequence,  the  code  book  C  is 
designed  so  that 

j-iurndWA)  (2) 

1  1 

achieves  at  least  a  local  minimum.  If  the  training  sequence  consists  of  typical 
speech  and  it  is  represented  with  a  small  average  distortion  by  the  code  book, 
then  C  should  encode  new  speech  with  a  similarly  small  distortion.  In  practice, 
code  books  are  designed  by  an  iterative,  clustering  technique.  The  algorithm 
used  here  is  based  on  the  work  in  [  10, 2],  Put  simply,  the  L  frames  of  the  train¬ 
ing  sequence  are  divided  into  N  clusters  such  that  all  frames  in  the  same  cluster 
have  similar  spectrum  shapes.  The  N  codewords  are  the  centroids  of  these  clus¬ 
ters. 

B-  VQ  Word  Recognition 

In  speech  coding  by  VQ,  a  single  code  book  is  designed  from  a  long  training 
sequence  that  is  representative  of  all  speech  to  be  encoded  by  the  system.  In 
the  single-section  approach  to  isolated  word  recognition  [8,  9,  16],  we  used  a 
separate  code  book  for  each  word  in  the  recognition  vocabulary.  We  designed 
each  code  book  from  a  training  sequence  containing  repetitions  of  one 


vocabulary  word.  For  example,  a  code  book  for  the  word  "seven"  would  be 
designed  by  running  the  vector  quantizer  design  algorithm  on  a  training 
sequence  of  several  repetitions  of  the  word  "seven".  To  classify  an  unknown 
word,  it  is  first  encoded  using  each  of  the  code  books  and  the  average  distortion 
for  each  code  book  is  recorded.  The  unknown  word  is  then  classified  according 
to  the  code  book  yielding  the  Lowest  average  distortion. 

Our  new  method,  based  on  *10],  represents  each  vocabulary  word  as  a 
time-dependent  sequence  of  section  code  books,  which  we  call  a  muiti-section 
code  book.  New  words  are  classified  by  performing  VQ  and  finding  the  multi- 
section  code  book  that  achieves  the  smallest  average  distortion. 

To  be  more  precise,  let  V  be  the  number  of  words  in  the  recognition  voca¬ 
bulary,  and  let  Tk  be  the  number  of  utterances  in  the  training  sequence  used  to 
design  code  book  Ck  for  the  k **  vocabulary  word,  where  k- 1,  •  •  •  ,K.  Also,  let 
Fqk  be  the  number  of  frames  in  the  qlh  utterance  in  the  training  sequence  for 
Cfc  where  q- 1,  •  •  ,Tk.  and  finally,  let  be  the  m01  frame  in  the  q01  training 

utterance  for  Ck  where  m  =  l,  •  •  •  F<jk.  Then  there  are  V  multi-section  code 
books  Ck,  each  comprising  a  sequence  of  VQ  section  code  books  C*;  .  The  section 
code  book  C*;  is  designed  using  n  frames  from  each  training  utterance  for  the 
k 01  vocabulary  word.  That  is,  C*;  is  designed  from  the  frames  where 

+  •  ■  •  ,jn,  and  g  =  l.  ■  ■  Tk.  In  particular,  Ckl  is  designed  from  the 
first  n  frames  of  each  training  utterance  for  the  k**  word  in  the  recognition 
vocabulary,  Ckz  from  the  second  n  frames,  etc.  We  call  n  the  compression  fac¬ 
tor  —  it  is  the  number  of  frames  that  are  spanned  per  section.  It,  tor  a  particu¬ 
lar  training  utterance  q,  m  is  greater  than  Fqk ,  the  corresponding  frames 
lie  beyond  the  end  of  the  word  and  are  not  included  in  the  training  sequence  for 
Cy.  Finally,  let  Ckii,  i  =  l . Nq  be  codewords  in  section  code  book  C*;.  We  call 


the  V  multi-section  code  books  ;  fc  =  l,  -  -  .  V\  a  code  book  set. 

Suppose  a  new  utterance  to  be  classified  contains  L  frames,  and  Pt  is  the 

set  of  autocorrelation  estimates  from  the  Zth  frame  [1  = : _ L).  Now  let  Dk  be 

the  average  distortion  resulting  from  coding  the  unknown  utterance  with  the 
code  book  Ck , 


where  Sk  is  the  number  of  section  code  books  in  Ck,  and 

min  [ in.  L ] 

dkj  =  £  min  d{Pt.Ckji). 

t=(j-l)n+l  1 


(4) 


is  the  total  distortion  from  coding  the  j “*  section  of  the  input  with  the  jth  sec¬ 
tion  code  book  C*,  of  Cfc,  and  where  n  is  the  compression  factor.  Then  the  utter¬ 
ance  is  classified  as  the  r**  word  in  the  recognition  vocabulary,  where 


0r=nun£fc.  (5) 

If  desired,  one  can  select  a  set  of  threshold  values  D ^  and  require  Dr  <X'mm  in 
(3)  for  a  valid  classification.  This  can  improve  classification  reliability. 

If,  in  the  above  description,  all  words  are  aligned  at  their  beginnings,  we  call 
the  approach  left-aligned.  In  the  left-aiigned  case,  variations  in  speaking  rates 
often  result  in  several  sounds  being  included  in  the  training  sequences  for  indivi¬ 
dual  section  code  books.  To  reduce  this  effect,  we  also  tried  linearly  normalizing 
all  training  sequence  and  classification  utterances  to  the  same  length.  'Ve  call 
this  approach  length-normalized. 

In  the  length-normalized  approach,  the  number  of  sections  in  the  input 
word  is  always  equal  to  the  number  of  section  code  books.  In  the  left-aligned 
approach,  however,  the  input  word  can  have  more  or  less  sections  than  the  code 


1 


6 


books;  we  stop  encoding  a  word  in  a  code  book  when  we  run  out  of  either  input 
word  frames  or  code  book  sections. 


In  the  foregoing  terms,  the  approach  in  [10]  corresponds  to  left-alignment 
with  n  =  l.  For  left-alignment  with  n  greater  than  or  equal  to  the  maximum 
number  of  frames  in  all  the  training  utterances,  the  multi-section  approach 
reduces  to  our  previous  single-section  approach.  [8,  9,  16] 

C.  Mvlti-Section  Code  Books 

Each  classification  code  book  Ck  is  designed  from  a  separate  training 
sequence  containing  repetitions  of  the  Arth  word  in  the  recognition  vocabulary.  A 
speaker-dependent  code  book  is  made  from  a  training  sequence  spoken  by  one 
person.  The  resulting  code  books  are  then  used  to  classify  additional  utterances 
from  that  speaker.  For  speaker-independent  code  books,  the  training  sequence 
for  each  code  book  is  spoken  by  several  people  and  the  code  books  are  used  to 
classify  additional  utterances  from  different  people. 

We  used  three  types  of  multi-section  code  books: 

(a)  fixed-size  code  books; 

(b)  fixed-distortion  code  books; 

(c)  unclustered  code  books. 

The  three  code  book  types  are  further  discussed  below. 

As  the  name  implies,  in  a  fixed-size  code  book  the  section  code  book  size 
Nii  . s  specified  ahead  of  time  and  the  design  algorithm  chooses  Skj  codewords 
that  minimize  the  average  distortion  resulting  from  encoding  the  training 
sequence  for  a  particular  section  code  book.  Section  code  book  sizes  are  limited 
for  convenience  to  powers  of  2,  i.e.,  Nkj  =2^,  where  rk  is  called  the  rafe  of  Ck3  . 
All  section  code  books  (and  thus  multi-section  code  books)  m  a  fixed-size  code 
book  set  have  the  number  of  code  words. 
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For  a  fixed-distortion  code  book,  the  design  algorithm  increases  the  section 
code  book  size  until  it  can  design  a  section  code  book  that  encodes  the  training 
sequence  with  an  average  distortion  that  is  less  than  or  equal  to  a  pre-spec:ffed 
value  T.  All  section  code  books  in  a  fixed-distortion  code  book  set  are  generated 
with  the  same  average  distortion  threshold  and  can  therefore  have  different 
sizes.  Like  fixed-size  section  code  books,  the  size  of  fixed-distortion  section 
code  book  are  limited  to  powers  of  2. 

The  third  type  of  code  book  is  the  uncLustered  code  book.  These  are  gen¬ 
erated  without  the  clustering  algorithm,  simply  by  making  a  codeword  out  of 
each  frame  in  the  training  sequence.  Our  motivation  for  considering  unclustered 
code  books  was  twofold.  The  first  was  computational  efficiency  and  convenience 
-  generating  them  is  much  easier  than  generating  clustered  code  books.  The 
second  was  as  a  measure  of  performance.  Since  the  clustering  procedure 
attempts  to  find  spectrum  shapes  that  are  representative  of  the  training 
sequence,  the  effectiveness  of  clustering  can  be  evaluated  by  comparing  the 
performance  of  clustered  and  unclustered  code  books  designed  from  the  same 
training  sequence. 

D.  Distortion  Measures 

In  generating  code  books  for  voice  coding,  two  distortion  measures  are 
effective  [2,20].  They  are  the  ItaJeura-Saito  (d/5)  and  gain  normalized  [takura- 
Saito  (dCN)  distortion  measures.  For  two  power  spectra  /  (tf)  and  /( iS),  the  d;s 
distortion  between  them  is 


disif  •/)  =  J' 


di} 

2rr 


jr-mf 


(6) 


For  power  spectrum  estimates  /  and  /  that  have  the  autoregressive  (LPC)  form 
a 2 


fW  = 


1 A(z ) 


a  ■ 


(?) 


cifcZ 


-fe 


where 

mo  - 1 

k-a 

and  z  =  exp(itf),  the  den  distortion  is  given  by 


rfGv(/,/)  =  */s(-£.-J^)  =  p--:. 


(8) 


where 


=  r(0)fa(0)  +  2f  r(n)fa(n), 


n=l 


**{”■)  ~  L  a,iun, 

1=0 

and  where  r(n )  are  the  time-domain  autocorrelations  of  /  (tf). 

Equations  (6)  and  (?)  show  that  d/5  depends  on  both  the  spectrum  shape 
and  the  gain  (a2).  Thus,  using  it  in  (2)  to  design  code  books  results  in  clusters 
that  are  sensitive  both  to  spectrum  shape  and  gain.  Using  d^y.  however,  leads 
to  clusters  that  depend  only  on  spectrum  shape.  After  extensive  speech  recog¬ 
nition  experiments  comparing  the  performance  of  these  two  distortion  meas¬ 
ures  using  single-section  code  books  [  16],  we  concluded  that  da v  code  books  are 
better  for  speech  recognition  than  d/5  code  books,  particularly  when  using  small 
code  books  built  from  short  training  sequences.  Thus,  we  used  dav  code  books 
in  the  work  reported  herein. 
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For  the  classification  distortion  measure  in  (4),  we  considered  three 
choices:  d!S,  d^w,  and  the  gain  optimized  ftalcura-Saito  ( dco )  distortion  meas¬ 


ure. 

*co\f  •/)  3  min  dfS(f  ,A /) 


Like  dGy  ^ao  is  sensitive  to  spectral  shape  only.  Properties  of  all  three  distor¬ 
tion  measures  are  discussed  in  [21].  In  our  work  with  single-section  code  books 
[16],  we  found  dco  to  be  the  best  choice,  and  we  used  that  same  choice  in  the 
work  reported  herein.  For  LPC  spectra  of  the  form  (7),  dco  can  be  expressed  as 

d’Coif  ■/)  =  ln(a) -Into2).  (10) 


E.  LPC  Parameters 

LPC  parameters  for  both  code  book  generation  and  utterance  classification 
were  generated  using  the  autocorrelation  method  with  Hamming  windowing 
Except  for  N,  the  number  of  points  to  shift  between  successive  speech  frames, 
we  chose  analysis  conditions  for  compatibility  with  the  Navy's  2.4-kbs  LPC-10 
systemi[22]:  analysis  window  width  =  130  points,  filter  order  =  10,  and  pre- 
emphasis^^.  When  using  the  length-normalized  approach,  N  was  adjusted  to 
satisfy  the  normalization  length  requirement;  however,  when  using  the  left- 
aligned  approach.  ;V  =  180  was  used  as  is  done  for  the  Navy’s  LPC-10  system.  The 
LPC  analysis  parameters  used  in  classifications  were  always  chosen  to  match 
those  used  in  generating  the  code  books. 
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F.  Figures  af  Merit 

The  error  rates  reported  in  this  paper  are  substitution  error  rates.  We 
forced  a  choice  for  each  utterance  presented  to  the  recognizer  algorithm,  and 
we  presented  only  utterances  that  contained  legitimate  vocabulary  words. 

We  used  two  figures  of  merit  in  evaluating  the  experiments.  The  first  is  sim¬ 
ply  the  recognition  accuracy.  The  second  attempts  to  quantify  the  extent  to 
which  the  classifications  are  correct  or  incorrect.  In  particular,  suppose  that 
the  input  utterance  is  the  mth  word  in  the  recognition  vocabulary.  For  correct 
classification,  Dm  should  be  the  smallest  of  the  average  distortions  (3)  -  i.e., 
Dr-Dm  (see  (5)).  Define 

D*  -  min  Dk  (*•>) 

krm 


as  the  smallest  average  distortion  of  all  code  books  except  the  co-rect  one,  and 
define 


(1 


2) 


If  the  classification  is  correct,  R>0\  it  the  classification  is  incorrect,  R<0.  For 
correct  classifications,  R  is  the  fractional  difference  between  the  distortion  of 
the  correct  code  book,  and  the  distortion  of  the  next  best  choice  -  a  large  value 
of  R  means  that  the  correct  code  book  stands  out  clearly  from  the  other 
choices.  For  each  experiment,  we  computed  the  number  of  errors,  the  average 
value  of  R  (/?»).  and  the  standard  deviation  of  R  { R„ }. 


III.  EXPERIMENTAL  BACKGROUND 

Our  experiments  were  conducted  using  a  data  base  that  was  prepared  by 
Texas  Instruments,  Inc.  (TI)  during  a  systematic  test  of  discrete-utterance 
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recognition,  devices  [23],  A  data  base  should  be  used  solely  for  either  tuning  or 
testing  a  recognition  algorithm.  To  balance  the  conflict  between  tuning  and 
unbiased  testing,  we  chose  the  following  procedure.  We  first  tuned  the  algorithm 
based  on  prior  experience  and  on  a  speaker-independent,  maie-cniy  parameter 
study.  We  then  tested  the  tuned  algorithm  on  the  female  speakers  in  the  data 
base.  In  addition,  we  tested  the  tuned  algorithm  in  a  speaker-dependent  mode 
on  the  entire  TI  data  base. 

Automatic  endpoint  detection  for  both  training-sequence  and  classification 
utterances  was  used  in  our  experiments.  Our  endpoint-c  tection  algorithm  is 
based  on  ideas  presented  in  '24,  25],  and  is  described  in  [16],  Briefly,  the  algo¬ 
rithm  first  analyzes  the  background  noise  to  determine  its  average  magnitude 
and  then  uses  the  results  to  set  various  thresholds  that  are  used  to  find 
significant  "energy  clumps"  in  the  data. 

In  the  rest  of  this  section  we  describe  the  data  base,  the  experimental 
parameters,  and  the  experiments. 

A  TI  Data  Base 

The  TI  data  base  [23]  consists  at  twenty  words:  the  digits  zero  through  nine 

and  the  ten  control  words  yes,  no,  erase,  rubout,  repeat,  go,  enter,  help,  stop. 

and  start.  Eight  male  and  eight  female  speakers  each  recorded  twenty-six 

repetitions  of  each  word  in  the  vocabulary,  for  a  total  of  8320  utterances.  The 

data  was  recorded  on  analog  tape  under  tightly  controlled  conditions:  the  noise 

level  was  low,  the  speech  level  was  restricted  to  a  =3  dB  range,  the  acoustic 

«  ■ 

environment  was  unvarying,  and  all  errors  in  the  input  words  were  eliminated. 
After  collection,  the  data  was  low  pass  filtered  and  sampled  at  12,500  samples 
per  second.  We  received  the  data  in  digital  form  on  magnetic  tape.  Each  utter¬ 
ance,  preceded  and  followed  by  short  segments  of  ambient  noise,  was  contained 


in  a  separate  file.  In  a  previous  study  using  single-section  code  books  16],  we 
used  the  data  primarily  at  the  12,500  sampling  rate.  For  the  work  reported 
here,  the  data  was  down  sampled  to  8000  samples  per  second.  The  down  sam¬ 
pling  procedure  is  described  in  ’16]. 

B.  Experimental  Parameters 

In  this  subsection,  we  describe  the  experimental  parameters  associated 
with  code  book  generation  and  utterance  classification.  The  code  book  genera¬ 
tion  parameters  are  as  follows: 

(a)  number  of  utterances  in  the  training  sequence: 

(b)  energy  threshold  E^,  where  E  is  computed  by 

E=tzf: 

Here,  W  is  the  analysis  window  width,  and  x*  are  the  time-domain  sam¬ 
ples  from  a  12  bit  A/D  converter  after  pre-emphasis  and  Hamming  win¬ 
dowing; 

(c)  left-alignment  or  length-normalized  alignment; 

(d)  compression  factor; 

(e)  code  book  type  and  size. 

The  energy  threshold  is  used  to  ignore  nearly-silent  frames;  frames  with 
energy  below  this  threshold  are  not  used  in  designing  code  books  or  performing 
a  classification.  For  all  the  work  reported  here,  we  used  E\ min=25D- 

The  parameters  associated  with  utterance  classification  are  as  follows: 

(a)  compression  factor; 
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(b)  utterance  alignment; 

(c)  energy  threshold. 

For  consistency  these  values  were  chosen  to  match  those  used  in  the  code  book 
generation. 

C.  List  o f  Experiments 

In  this  subsection,  we  list  the  experiments  reported  in  the  remainder  of  the 
paper.  The  following  speaker-independent  experiments  are  listed  according  to 
the  corresponding  subsection  of  Section  IV 

A.  Complete  maile-data-base  study  of  recognition  accuracy  as  a  function  of 
compression  factor  and  section  code  book  rate; 

Comparison  of  recognition  performance  using  unclustered  and 
clustered  code  books  when  using  the  "best”  compression  factor; 

Study  of  recognition  accuracy  as  a  function  of  the  normalization  length; 

Recognition  accuracy  comparison  using  fixed-size  and  fixed-distortion 
code  book  sets; 

Recognition  accuracy  comparison  of  left-aligned  and  length-normalized 
approaches; 

B.  A  female-only  experiment  using  parameters  that  did  best  during  the 
male  parameter  study; 

C.  Classification  of  4  speakers  using  code  books  designed  from  both  male 
and  female  speakers. 

Section  V.  contains  the  results  of  speaker-dependent  experiments.  The 
experiments  are  listed  according  to  the  corresponding  subsection  of  Section  V: 


A.  Comparison  of  multi-section  and  single-section  recognition  perfor¬ 
mance  on  the  sixteen  speaker  data  base; 

B.  A  rate-0  muiti-section  study; 

C  Recognition  results  for  fixed-size  code  books  with  short  training 
sequences; 

Recognition  results  for  unclustered  code  books  with  short  training 
sequences. 

IV.  SPEAKER-INDEPENDENT  EXPERIMENTS 

In  this  section,  we  describe  three  sets  of  experiments  The  first  set  were 
parameter  studies  done  on  just  the  made  speakers  —  we  varied  the  compression 
factor,  section  code  book  rate,  utterance  alignment,  and  code  book  design 
method.  Based  on  the  results,  we  give  guidelines  for  parameter  se'ection.  In  the 
second  set  of  experiments,  the  parameters  were  fixed  based  on  the  results  of 
the  first  set,  and  speaker-independent  classification  experiments  were  done  for 
the  female  speakers.  In  the  last  set,  a  combined  male  and  female  recognition 
experiment  was  done. 

A  Male  Parameter  Study 

For  all  parameter  studies,  the  LPC  parameters  are  those  specified  in  sec¬ 
tion  II.E.  We  considered  each  of  the  3  male  speakers  in  turn.  For  each  male 
speaker,  we  classified  520  utterances  using  code  books  designed  from  the  first  9 
utterances  from  each  of  the  other  7  males.  We  used  multiple  repetitions  by 
speakers  in  the  training  sets  because  of  the  small  number  of  speakers  we  had 
available,  not  because  we  believe  it  to  be  an  efficient  way  to  train  a  recognizer 

In  the  first  parameter  study  we  examined  the  relationships  among 
compression  factor,  section  code  book  rate,  and  recognition  accuracy  We  used 
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a  24-frame,  length-normalized  approach  -  24  frames  was  approximately  the 
average  length  of  the  words  in  the  recognition  vocabulary.  We  used  fixed-size, 
section  code  books  with  rates  2,3,  and  4  together  with  compression  factors 
1,2, 3, 4,6,3,  and  12.  The  results  are  plotted  in  Figure  1.  Note  that  each  point  on 
the  plot  represents  4160  speaker-independent  classifications  -  520 

classifications  per  speaker  for  9  speakers. 

Based  on  Figure  1,  we  make  the  following  observations: 

(a)  at  each  compression  factor,  the  error  spread  is  less  than  2%  for  all  sec¬ 
tion  code  book  rates; 

(b)  the  difference  in  error  rates  between  section  code  book  rates  2  and  3  is 
generally  small,  but  it  is  consistent  and  significant; 

(c)  there  is  no  significant  difference  in  error  rates  for  section  code  book 
rates  3  and  4; 

(d)  a  compression  factor  between  3  and  6  appears  best. 

To  gain  insight  into  any  relationship  among  word  complexity  (such  as  the 
number  of  syllables  or  phonemes),  compression  factor,  and  error  rate,  we  exam¬ 
ined  the  number  of  errors  as  a  function  of  compression  factor  for  the  nondigit 
words.  We  had  conjectured  that  simplier  words  like  no,  go.  and  yes  would  be 
easier  to  recognize  using  larger  compression  factors,  and  that  more  complex 
words  like  repeat,  rubout,  and  start  would  require  smaller  compression  factors. 
The  data,  however,  showed  no  obvious  correlation  between  word  complexity, 
error  rate,  and  compression  factor. 

Previously  [16],  we  performed  a  similar  speaker-independent  classification 
experiment  on  these  same  9  male  speakers.  There  we  used  the  single-section 
approach  and  the  original  12500  samples  per  second  data.  The  training  method 
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Figure  1.  Relationship  among  compression  factor,  error  rate,  and  section  code 
book  rate  for  speaker  independent  recognition. 


was  the  same  as  used  here:  nine  utterances  from  each  of  the  seven  speakers  not 
being  classified  were  used  to  build  code  books.  The  analysis  conditions  con¬ 
sisted  of  the  following:  N  =  250  (20  milliseconds),  analysis  window  =  250  points, 
filter  order  =  16,  pre-emphasis  =  90%,  and  Hamming  windowing.  As  in  this  study, 
the  autocorrelation  method  of  LPC  was  used.  Using  rate-5,  single-section  code 
books,  an  average  recognition  accuracy  of  86%  was  achieved,  as  opposed  to  the 
97.5%  achieved  by  the  current  approach.  Thus  by  using  the  multi-section 
approach,  the  number  of  distortion  computations  per  classification  has  been 
reduced  (by  a  factor  of  4  for  rate-3  section  code  books),  and  the  number  of 
errors  has  been  reduced  by  about  a  factor  of  4. 

As  stated  earlier,  unclustered  code  books  are  generated  by  making  a  code¬ 
word  out  of  each  frame  in  the  training  sequence,  and  the  effectiveness  of  clus¬ 
tering  can  be  evaluated  by  comparing  the  performance  of  unclustered  and 
clustered  code  books  designed  from  the  same  training  sequence.  We  built 
unclustered  code  books  using  a  compression  factor  of  4  and  the  same  LPC 
analysis  parameters  as  specified  for  the  clustered  code  books.  The  result  is 
marked  by  NC  in  Figure  1.  The  degradation  in  recognition  performance  using 
rate-3  clustered  code  books  instead  of  unclustered  code  books  is  small  -  about 
.5%.  Since  the  rate-3,  multi-section  code  books  are  only  about  1/30  the  size  of 
the  unclustered  code  books  and  the  error  rates  for  the  two  are  close,  it  is 
apparent  that  the  clustering  procedure  performs  an  effective  data  compression 
function. 

Next  we  studied  the  effect  of  normalization  length  on  recognition  accuracy. 
We  felt  that,  in  general,  longer  normalization  lengths  would  result  m  higher 
recognition  accuracies.  Doubling  the  normalization  length,  however,  also  dou¬ 
bles  the  number  of  distortion  computations  needed  to  compare  an  input 


utterance  with  a  code  book.  We  were  searching  for  the  shortest  normalization 
length  that  did  not  significantly  degrade  the  recognition  accuracy.  To  study 
this,  we  chose  normalization  lengths  of  12.  24,  and  36.  We  used  rate-3  section 
code  books,  and  the  compression  factor  was  adjusted  in  each  case  so  that  there 
were  6  section  code  books  per  word.  Note  that  for  a  fixed  analysis  window  width, 
increasing  normalization  length  increases  the  overlap  between  adjacent  analysis 
frames. 

The  results,  listed  in  Table  I,  show  that  the  average  recogmtion  accuracy 
increases  gradually  with  increases  in  the  normalization  length.  The  question 
remains,  however,  whether  the  increase  is  significant. 

To  test  for  statistical  significance,  we  used  the  two-sample  Wilcoxon  rank 
sum  test  [26].  For  this  test,  let  F{x)  be  the  probability  distribution  function 
describing  the  recogmtion  accuracy  z  of  a  multi-section  approach  with  a 
specific  set  of  multi-section  parameters  [compression  factor,  section  code  book 
rate,  normalization  length,  etc.).  In  the  normalization  length  study  described 
above,  let  F,  (x)  be  the  probability  distribution  function  describing  the  recogm¬ 
tion  performance  of  one  of  the  shorter  length-normalized  approachs,  and  let 
Ft(x)  be  the  probability  distribution  function  for  an  approach  with  a  longer  nor¬ 
malization  length.  Also,  let  p,  be  the  mean  recogmtion  accuracy  corresponding 
to  F,(x),  and  let  Ft{x)  have  a  mean  The  null  hypothesis  for  our  test  is 
Ff(x)=Fl(x)  for  all  x:  thus,  /z,  =/U(.  The  alternative  hypothesis  is 
F,(x)=F(x+A)  for  some  positive  A,  or  Fs  (x)  is  shifted  to  the  left  of  Ft(z).  This 
implies  <  tM 

We  performed  the  Wilcoxon  test  for  all  three  length  combinations:  12  vs.  24. 
12  vs.  36,  and  24  vs.  36.  The  significance  levels  for  rejection  of  the  null 
hypothesis  of  equal  mean  recogmtion  accuracies  were  .186,  .104,  and  397 
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respectively.  Based  on  the  Wilcoxon  test  results  and  the  average  recognition 
accuracies,  we  believe  the  increase  in  computations  in  going  from  12  frames  to 
24  frames  is  justified,  but  the  increase  in  going  to  36  frames  is  not  justified. 

Previously  [16],  we  compared  the  performance  of  fixed-distortion  and 
fixed-size  code  books  using  the  single-section  approach.  Although  in  that  study 
the  fixed-size  code  books  performed  better  than  the  fixed-distortion  code  books, 
we  felt  this  might  not  held  true  when  using  multi-section  code  books.  One  rea¬ 
son  is  that  each  section  code  book  represents  only  a  small  portion  of  a  word 
instead  of  the  whole  word  as  in  the  single  section  approach.  This  restriction 
might  reduce  the  types  of  confusions  that  earlier  caused  fixed-distortion  code 
books  to  perform  worse  than  fixed-size  code  books.  The  possible  advantages  of 
fixed-distortion  code  books  are  that  each  fixed-distortion  code  book  is  only  as 
large  as  necessary  to  satisfy  the  distortion  criterion.  Thus  it  follows  that  fixed- 
distortion  code  books  might  lead  to  the  same  classification  performance  as 
fixed-size  code  books  but  with  fewer  total  codewords.  This  could  lead  to  smaller 
memory  requirements  and  faster  classification  performance. 

We  chose  T  =  .45  and  T  =  .30  as  distortion  thresholds,  and  we  designed 
fixed-distortion  code  books  sets  using  the  same  conditions  as  used  in  the  previ¬ 
ous  fixed-size  code  book  studies.  For  the  T  -  .45  threshold,  the  average  section 
code  book  size  was  7.35  codewords;  for  the  T  =  30  threshold,  it  was  15  99  code¬ 
words. 

The  average  recognition  accuracy  using  the  fixed-distortion  code  books  with 
T  =  .45  was  96.5%.  With  T  =  .30,  the  recognition  accuracy  was  96.3%.  The  fixed- 
size,  rate-3  and  -4  code  book  sets  had  recognition  accuracies  of  97.2%  and  97  5% 
respectively.  So,  as  before  [16],  the  fixed-size  code  books  discriminate  better  in 
word  recognition  than  do  fixed-distortion  code  books. 


So  far,  the  experiments  used  Length-normalized  code  books.  We  tested  the 
Left-aligned  approach  using  a  compression  factor  of  4,  a  section  code  book  rate 
of  3,  and,  except  for  N  (the  number  of  points  to  shift  between  successive  speech 
frames),  the  same  analysis  conditions  as  before.  In  the  Left-aligned  experiment, 
N  was  fixed  at  130.  Left  alignment  was  used  both  to  design  code  books  and  to 
classify  input  utterances. 

The  left-aligned  results  together  with  the  rate-3,  compression  factor  4, 
Length-normalized  results  are  shown  in  Table  II.  The  length-normalized 
approach  is  clearly  superior.  This  conclusion  is  also  supported  by  the  Wilcoxon 
test:  the  significance  level  is  .012  for  rejecting  the  null  hypothesis  of  equal  mean 
recognition  accuracies. 

The  foregoing  results  suggest  the  following  guidelines: 

(a)  length  normalization  should  be  used  with  analysis  conditions  that  pro¬ 
vide  frame  overlap; 

(b)  the  compression  factor  should  correspond  to  roughly  20%  of  the  nor¬ 
malized  length; 

(c)  fixed-size  section  code  books  of  at  least  rate-3  should  be  used. 

Although  the  speakers  in  these  studies  possessed  several  of  the  major  dialects, 
the  speaker  sample  was  small  and  homogeneous  -  8  male  speakers  Living  in 
Texas.  Thus,  the  rate-3  section  code  books  might  be  too  small.  In  the  next  two 
sections  we  further  evaluate  this  issue  by  studying  a  female  speaker  sample  and 
a  combined  male  and  female  speaker  sample. 
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B.  Femaie  Jesuits 


Using  a  compression  factor  of  4  and  24-frame  length  normalization,  we  stu¬ 
died  speaker-independent  recognition  using  the  3  femaie  speakers.  As  in  the 
maie  study,  we  classified  520  utterances  for  each  speaker  using  code  books 
designed  from  the  first  9  utterances  of  each  speaker  not  being  classified.  The 
rate-3  and  -4  results  are  listed  in  Table  III.  The  rate-4  code  books  performed 
better  that  the  rate-3  code  books,  but  the  difference  does  not  appear  to  be 
significant  —  the  WiLcoxon  test  yields  a  large  significance  level  of  316  for  reject¬ 
ing  the  null  hypothesis  of  equal  average  recognition  accuracies. 

The  average  recognition  accuracy  of  93.3%  for  femaies  is  significantly  less 
than  the  97.2%  found  for  males.  About  half  of  the  female  errors,  however,  were 
for  two  speakers:  SAS  and  DFG.  On  examining  the  data  we  found  that  most  of 
the  errors  for  DFG  occurred  for  words  on  which  the  endpoint  detector  had 
grossly  misidentified  the  endpoints:  her  voice  had  a  breathy,  nasal  quality  that 
was  unlike  the  other  speakers.  This  was  not  the  case  for  3A3.  however  There 
seemed  to  be  nothing  obviously  unusual  about  her  speech,  yet  it  was  difficult  to 
recognize. 

To  see  if  the  addition  of  new  speakers  to  the  training  sequence  would 
improve  the  recognition  performance,  we  recorded  data  from  10  additional 
female  speakers.  The  speakers  were  chosen  arbitrarily.  Each  new  speaker  pro¬ 
vided  1  utterance  of  each  vocabulary  word.  The  new  data  was  down  sampled  to 
6000  samples  per  second  using  the  same  procedure  as  used  on  the  TI  data,  and 
it  was  added  to  the  previous  training  data.  No  analysis  or  experimental  condi¬ 
tions  were  changed.  The  results  using  the  expanded  training  sequences  are 
shown  in  Table  IV. 


The  average  recognition  accuracy  increased  to  95.1%  (98.  dJo  ror  just  trie 
digits),  but  more  interesting,  the  improvement  was  restricted  to  the  two  hardest 
speakers:  SAS  and  DFG.  Thus,  adding  more  training  data  improved  the  recogni¬ 
tion  performance  for  the  speakers  that  were  poorly  represented  by  the  original 
training  sequence  and  neither  degraded  nor  improved  the  results  for  the  rest  of 
the  speakers.  Table  V  contains  the  confusion  matrix  for  the  female  experiments 
using  the  expanded  17-speaker  training  sequence.  Each  row  contains  the  results 
for  classifying  all  utterances  of  one  word  in  the  recogmtion  vocabulary:  the 
columns  correspond  to  the  different  classification  decisions.  The  most  frequent 
errors  were  no*--*go  and  stop-* five.  The  no  and  go  errors  were  generally  caused 
by  their  spectral  and  temporal  similarities.  The  rest  of  the  errors  are  not  so 
easily  categorized,  but  they  usually  could  be  attributed  to  inadequacies  in  the 
training  data  or  to  inaccurate  endpoint  detection. 

C.  Combined  Male  and  Female  Feszdts 

The  separate  results  for  males  and  females  suggest  that  a  rate-3,  multi¬ 
section  code  book  is  adequate  for  recognition  purposes.  This  may  not  be  the 
case  for  mixed  populations,  however.  Because  general  differences  in  male  and 
female  vocal  tracts  sizes  lead  to  characteristic  formant  shifts  for  the  same 
speech  sounds,  larger  code  book  sizes  might  be  required  to  maintain  perfor¬ 
mance  for  mixed  populations.  We  examined  this  issue  by  performing  a  recogm¬ 
tion  experiment  on  a  4  speaker  subset  of  the  TI  data  base  (2  males:  RLD  and 
GRD,  and  2  females:  SAS  and  ALK).  We  used  code  books  designed  from  the 
remaining  12  speakers  -  each  speaker  provided  9  utterances  of  each  word  as 
training  data. 

The  results  for  section  code  book  rates  1  through  5  are  shown  in  Table  VI. 
For  this  small  speaker  sample,  no  significant  improvement  resulted  from  a 
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section  code  book  rate  greater  than  3.  Table  VII  contains  the  individual  rate-3 
results  for  the  combined  male-female  training  data  experiment  and  the  earlier 
rate-3  single-sex  experiments.  A  large  increase  in  recognition  accuracy  for  3A5 
offset  small  decreases  in  recognition  accuracy  for  the  rest  of  the  speakers,  and 
the  average  recognition  accuracy  'using  the  combined-sex  training  sequences 
was  about  the  same  as  that  using  the  single-sex  training  sequences.  The  spread 
in  recognition  accuracies,  however,  using  the  combined-sex  training  sequences 
has  been  dramaticly  reduced.  The  reduced  spread  in  recognition  accuracies 
suggests  the  12-speaker  training  sequences  characterize  the  general  population 
better  than  the  7-speaker  training  sequences  used  earlier,  and  it  gives  evidence 
that  increased  stability  of  performance  would  result  from  using  richer  training 
sequences. 

V.  SPEAKER-DEPENDENT  EXPERIMENTS 

In  this  section,  we  describe  the  results  of  speaker-dependent  experiments. 
In  the  first  experiment,  the  multi-section  approach  was  tested  on  the  full  TI  data 
base.  In  the  second,  two  muiti-section  rate-0  approaches  were  compared,  and  in 
the  final  experiment,  the  effect  of  short  training  sequences  was  examined.  All 
the  experiments  described  in  this  section  used  the  24-frame,  length-normalized 
approach. 

A.  Multi-Section  Results 

In  the  speaker-independent  study  described  m  the  last  section,  good  recog¬ 
nition  performance  required  a  section  code  book  rate  of  at  least  3.  It  seems 
reasonable,  however,  that  a  smaller  section  code  book  rate  might  suffice  for 
speaker-dependent  recognition.  To  evaluate  this  possibility,  we  performed 
speaker-dependent  recognition  experiments  using  the  16  speakers  in  the  TI  data 


base  For  each,  speaker,  the  first  10  utterances  of  each  word  were  used  as  a 
training  sequence.  We  used  a  compression  factor  of  4  and  section  code  book 
rates  0,  1.  and  2. 

Table  VIII  contains  the  results  for  ail  16  speakers.  The  first  3  are  male  and 
the  last  3  are  female,  and  the  male  results  are  slightly  better  than  the  female 
results.  As  one  would  expect,  the  average  recognition  accuracy  improves  ’.nth 
increases  in  section  code  book  rate.  Using  the  two-sample  Wilcoxon  test  to  com¬ 
pare  the  rate-0  vs.  rate-1,  rate-1  vs.  rate-2,  and  rate-0  vs.  rate-2  results,  the 
significance  levels  for  rejection  of  the  null  hypotheses  of  equal  average  recogni¬ 
tion  accuracies  were  .138,  .133,  and  .031  respectively.  Based  on  the  Wilcoxon 
test  results  and  the  average  recognition  accuracies,  we  believe  the  use  of  rate-2 
section  code  books  significantly  increases  the  recogmtion  accuracy  compared  to 
rates  0  and  1. 

The  average  recognition  accuracy  obtained  with  the  rate-2  section  code 
books  was  98.7%.  A  confusion  matrix  for  these  results  is  shown  in  Table  IX.  The 
most  frequent  errors  were  go*- -no,  stop-five,  and  start-five.  Most  of  the  go 
and  no  classification  errors  were  due  to  their  spectral  and  temporal  similarities. 
Many  of  the  other  classification  errors  can  be  attributed  to  time  alignment  prob¬ 
lems  caused  by  inadequacies  of  the  endpoint  detector. 

To  be  more  specific,  we  examined  the  errors  made  using  the  rate-2  section 
code  books:  there  were  66  words  incorrectly  classified.  The  endpoints  had  been 
misidentifled  on  42  of  those  66  words.  We  hand  labeled  the  endpoints  on  those 
42  words  and  reclassified  them  in  the  original  code  books.  Thirty-eight  of  the  42 
words  were  now  correctly  identified,  and  the  average  recogmtion  accuracy 
increased  to  99.5%.  This  improvement  points  out  the  importance  of  accurate 
endpoint  detection. 
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In  our  previous  single-section  work  [  16],  we  performed  a  similar  speaker- 
dependent  classification  experiment  on  the  TI  data  base.  In  that  work,  the 
12500  samples  per  second  data  was  used  together  with  the  following  analysis 
conditions:  .V  =  250  points,  analysis  window  =  250  points,  analysis  filter  order  = 
16,  pre-emphasis  =  90%,  and  Hamming  windowing.  As  in  this  study,  we  used  the 
autocorrelation  method  of  LPC  and  the  first  10  utterances  of  each  word  for  each 
speaker  as  training  data.  The  recognition  accuracy  using  single-section,  rate-3 
code  books  on  the  full  bandwidth  data  was  about  the  same  as  using  multi- 
section.  rate-2  code  books  on  the  narrow  bandwidth  data:  98.8%  and  96.7% 
respectively.  Based  on  reductions  in  both  the  analysis  filter  order  and  the  sec¬ 
tion  code  book  rate,  incorporating  time-sequence  information  reduced  the  com¬ 
putational  requirements  by  slightly  more  than  a  factor  of  3,  at  the  expense  of 
doubling  the  memory  required. 

B.  Rate-0  Multi-Section  Study 

The  most  remarkable  aspect  of  the  above  speaker-dependent  results  is  the 
high  recognition  accuracy  of  the  rate-0  code  books.  The  multi-section  code 
book  for  each  word  consists  of  only  6  codewords  —  one  codeword  per  section  - 
and  the  classification  of  an  input  utterance  requires  only  one  distortion  compu¬ 
tation  per  input  frame  for  each  vocabulary  word.  Moreover,  the  code  book  gen¬ 
eration  consists  simply  of  computing  autocorrelations  and  averaging  them, 
which  is  also  easy  to  do  quickly.  Yet,  despite  these  major  simplifications,  a 
recognition  accuracy  of  97.8%  was  achieved.  Considering  only  the  digits,  the 
recognition  accuracy  was  99.5%. 

Building  references  by  linearly  normalizing  the  training  utterances  to  the 
same  length,  and  then  computing  the  average  of  a  set  of  parameters  for  each 
frame  in  the  normalized  word,  is  an  approach  that  many  researchers  evaluated 


i 


,  i 
r 


i  • 


26 


before  the  introduction  of  dynamic  programming  and  whole-utterance  cluster¬ 
ing  techniques.  Our  rate-0,  compression  factor  4  (R0C4)  approach  is  a 
modification  of  that  normalize-the-utterance  and  average-each-frame  (NUAF) 
approach  using  autocorrelations  as  the  parameters.  Because  of  the  similarity 
between  the  two  approaches,  it  is  reasonable  to  ask  if  our  R0C4  approach  is  any 
better  than  the  old  NUAF  approach. 

In  the  terminology  of  this  paper,  the  NUAF  approach  corresponds  to  using 
rate-0,  compression  factor  1  (R0C1)  code  books.  So,  we  designed  ROCl  code 
books  and  evaluated  them.  Based  on  the  speaker-independent  parameter  study 
results,  we  expected  the  larger  compression  factor  code  books  (R0C4)  to  per¬ 
form  better  than  the  smaller  compression  factor  code  books  (ROCl). 

Table  X  contains  the  ROCl  results  along  with  the  previous  R0C4  results  from 
Table  VIII.  Each  compression  factor  4  result  is  better  than  or  equal  to  the 
compression  factor  1  result  except  for  speaker  WMF,  and  using  the  Wilccxon  test 
on  the  two  samples,  the  significance  level  for  rejection  of  the  null  hypothesis  of 
equal  average  recognition  accuracies  is  .159.  We  believe  the  improved  perfor¬ 
mance  using  a  compression  factor  of  4  is  because  of  two  things:  the  slowly  vary¬ 
ing  nature  of  speech  spectra  and  the  freedom  from  strict  time  alignment  that  a 
compression  factor  of  4  allows.  Apparently,  averaging  the  spectra  in  the  train¬ 
ing  sequence  over  small  sections  of  a  word  produce  reference  spectra  that 
characterize  a  speaker's  variation  in  pronunciation  better  then  averaging  over  a 
single  frame.  Although  the  significance  level  for  rejection  of  the  null  hypothesis 
is  somewhat  large,  the  amount  of  storage  for  each  code  book  is  reduced  and  the 
recognition  accuracies  are  better  using  a  compression  factor  of  4. 


C.  Short  Training  Sequences 

Many  speaker-dependent  isolated  word  recognition  devices  on  the  market 
today  use  from  1  to  3  training  utterances  to  tram  the  system  '27],  Although  our 
previous  results  [16]  suggested  the  inadequacy  of  short  training  sequences,  we 
confirmed  this  expectation.  Using  a  compression  factor  of  4  and  the  first  2 
utterances  of  each  word  as  the  training  sequence,  we  classified  the  same  32D 
utterances  as  above  for  each  of  the  16  speakers.  The  average  recognition  accu¬ 
racies  were  94.6%,  95.6%,  and  95.7%  for  rate-0,  rate-1,  and  rate-2  multi-section 
code  books  respectively.  This  is  a  decrease  of  about  3%  at  each  rate  relative  to 
the  results  using  10-utterance  training  sequences  (see  Table  VIII). 

Finally,  we  performed  a  recognition  experiment  on  4  speakers  using  1- 
utterance  training  sequences.  We  used  unclustered  code  books  to  retain  all  the 
information  in  the  training  data,  and  we  used  a  compression  factor  of  4.  These 
results  along  with  the  2-  and  10-utterance  training  sequence,  rate-2  results  are 
shown  in  Table  XI.  The  effect  of  using  only  one  training  utterance  is  dramatic. 
The  average  recognition  accuracy  for  this  4  speaker  subset  has  fallen  to  90.9%. 

These  remits  using  short  training  sequences  simply  emphasi2e  what  is  com¬ 
monly  known:  there  is  much  variability  in  a  speaker's  pronunciation  of  a  partic¬ 
ular  word. 


VL  COMPUTATIONAL  AND  MEMORY  CONSIDERATIONS 

It  is  interesting  to  compare  the  computational  and  memory  requirements 
of  the  multi-section  VQ  approach  to  those  of  DTW  for  the  classification  of  an 
unknown  input  utterance.  As  we  pointed  out  earlier,  the  requirements  for  the 
DTW  approach  can  be  substantially  reduced  by  incorporating  VQ  into  the  DTW 
procedure,  but  we  do  not  consider  that  case  here.  Our  intention  is  to  compare 
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the  computational  and  memory  requirements  of  the  multi-section  VQ  with  that 
of  "classical"  DTW  [28].  Savings  obtained  by  tracking  the  average  distortion 
during  classification  to  reject  several  of  the  hypotheses  or  using  table-storage 
and  look-up  are  also  not  considered. 

In  this  analysis,  we  consider  only  the  length-normalized  approach.  Let  M  be 
the  LPC  analysis  filter  order,  Nsc  be  the  number  of  codewords  per  section  code 
book,  n  be  the  compression  factor,  and  Ls  be  the  normalization  Length.  Then 
the  memory  required  for  a  multi-section  code  book  is 

NSc  ceil 
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real  numbers,  where  ceil  [X]  is  the  smallest  integer  greater  than  or  equal  to  X. 
Since  the  input  word  is  normalized  to  Ln  frames,  classification  requires  Nsc^s 
distortion  computations  per  multi-section  code  book. 

In  DTW  approaches,  the  reference  template  and  the  input  utterance  are 
often  linearly  normalized  to  the  same  length  L  before  doing  DTW  [28].  High 
recognition  accuracies  can  then  be  achieved  with  <xLz  distortion  computations 
per  reference  template,  where  a  is  in  the  range  .20  to  35  [28].  Each  reference 
template  requires  L  storage  locations,  and  to  achieve  high  recognition  accura¬ 
cies,  several  reference  templates  per  vocabulary  word  are  normally  stored.  For 
speaker-dependent  recognition,  the  number  of  reference  templates  Q,  is  usually 
one  or  two;  for  speaker-independent  recognition.  Q  is  normally  about  ten  [29]. 

It  follows  that  the  ratio  D  of  the  number  of  distortion  calculations  required 
by  the  VQ  approach  to  the  number  required  by  the  DTW  approach  is  about 
D  a  NscLn/  <xLzQ  For  fixed-size  code  books  with  Nsc=2  sc.  where  Rsc  is  the  sec¬ 
tion  code  book  rate,  and  for  a  nominal  value  of  a  *  .25,  the  ratio  becomes 
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D*2?SC**LN/  L*Q. 

We  shall  assume  that  both  normalization  lengths  are  L  =  Ly  =  3 2  frames  (640 
milliseconds  at  20  milliseconds  per  frame)  -  this  is  perhaps  too  large,  but  it  is 
conveniently  a  power  of  two.  It  follows  that  the  ratio  of  distortion  calculations 
becomes 


For  our  best  speaker-dependent  results  -  98.7”  correct  using  a  section 
code  book  rate  Rsc- 2  —  (13)  shows  the  ratio  of  distortion  computations  to  be 
1/2 Q.  Since  Q  is  usually  1  or  2  for  the  speaker-dependent  case,  this  shows  that 
the  multi-section  VQ  approach  requires  fewer  distortion  computations  than  DTW. 
The  90.7%  speaker-dependent  recognition  accuracy  of  the  multi-section 
approach  is  comparable  with  that  achieved  by  other  approaches  on  this  data 
base  [23].  For  speaker-independent  recognition,  the  multi-section  approach 
required  the  rate  /?sc=3-  For  this  case,  (13)  shows  the  ratio  of  distortion  com¬ 
putations  to  be  1  /Q.  Since  Q  is  approximately  10  for  the  speaker-independent 
case,  this  shows  that  the  multi-section  approach  requires  an  order  of  magnitude 
fewer  distortion  computations  than  DTW. 

The  ratio  W  of  memory  locations  required  by  the  multi-section  approach  to 
the  number  required  by  the  DTW  approach  is 

Nsc 

W  * - 

where  the  length  of  a  DTW  reference  L  has  been  assumed  equal  to  the  normaliza- 
tion  length  Ly  Using  a  Ly =32,  a  n  =.2Ly  and  substituting  2  sc  for  N sc  ■ 
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Equation  (14)  shows  that  for  reasonable  values  of  Q  and  Rsc>  speaker-dependent 
recognition  using  the  multi-section  approach  requires  about  one-half  the 
memory  that  DTW  requires,  and  for  speaker-independent  recognition,  the  multi- 
section  approach  requires  only  1/8  the  memory  that  DTW  requires. 

During  classification,  the  input  speech  frames  provide  the  argument  /  in 
(9).  It  follows  that  both  the  time-domain  autocorrelations  r(n)  and  the  LPC  gain 
squared  or2  must  be  known  for  each  input  frame,  which  in  turn  means  that  an 
LPC  analysis  must  be  done.  For  the  dco  distortion  measure,  however,  the  gain 
enters  as  a  constant  term  (ln(<72))  that  contributes  a  constant  term  in  the  com¬ 
putation  of  the  average  code  book  distortions  (3).  The  classification  can  there¬ 
fore  be  done  without  this  term,  so  no  LPC  analysis  of  the  input  utterance  is 
required  -  only  autocorrelations  need  be  computed. 

The  software  for  these  experiments  was  written  in  FORTRAN-77  and  run  on  a 
DEC  VAX1 1/750  with  a  floating  point  accelerator  Starting  with  the  autocorrela¬ 
tions  from  a  63-utterance  training  sequence,  generating  the  fixed-size,  rate-3, 
multi-section  code  books  required  about  2  minutes  of  execution  time  each. 
Classification  of  a  single  utterance  with  these  code  books  took  about  0. 1  second 
per  code  book  —  about  ten  times  faster  than  our  previous  approach  to  speaker 
independent  recognition  [16],  The  speedup  is  the  result  of  a  combination  of  fac¬ 
tors:  the  section  code  books  are  smaller  them  the  previous  single-section  code 
books  (8  code  words  instead  of  32  code  words),  the  narrower  bandwidth  data 
(4000  Hz.  vs  6250  Hz.)  allowed  a  reduction  in  the  LPC  filter  order  from  I6,h  to 
10**,  and  autocorrelations  were  computed  over  a  16  millisecond  window  instead 
of  a  20  millisecond  window.  Since  all  the  software  was  designed  for  research 
purposes,  specially  designed  programs  should  run  considerably  faster 
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VII.  SUMMARY  AND  DISCUSSION 


In  comparison,  to  our  previous  smgle-secuon  results  'iGj.  the  incorporation 
of  time-sequence  information  into  the  VQ  recognition  procedure  has  improved 
recognition  performance.  For  male  speaker-independent  recognition,  the  aver¬ 
age  recognition  accuracy  for  the  20-word  vocabulary  increased  from  33%  to  37% 
with  a  factor  of  4  reduction  in  computational  complexity  For  female  speakers, 
the  average  speaker-independent  recognition  accuracy  was  95%  on  the  20-word 
vocabulary,  and  it  was  38.5 %  on  just  the  digits.  For  speaker-dependent  recogni¬ 
tion.  the  multi-  and  single-section  approach  performed  approximately  the  same, 
but  the  multi-section  approach  required  only  half  the  number  of  distortion  com¬ 
putations.  The  costs  for  the  computational  and  accuracy  improvements  of  the 
muiti-section  approach  are  a  slightly  more  complicated  control  structure  and 
an  increase  in  memory  for  code  book  storage. 

Perhaps  the  most  remarkable  multi-section  VQ  result  was  the  37.3%  (99.5% 
for  digits)  speaker-dependent  recognition  accuracy  for  the  rate-0  section  code 
books.  Only  six  spectra  are  usod  to  characterize  each  vocabulary  word, 
classification  requmes  only  one  distortion  computation  per  input  speech  frame 
per  vocabulary  word,  and  the  code  book  design  requires  no  clustering 

The  memory  requirements  and  computational  complexity  of  the  speaker- 
dependent,  multi-section  approach  are  about  1/2  to  1-4  those  of  the  D Tff 
approach.  For  speaker-independent  recognition  the  multi-section  approach 
requires  only  about  1/8  the  memory  and  1/10  the  distortion  computations  of 
DTW.  It  follows  that  the  multi-section  approach  will  be  particularly  useful  when 
the  computational  and  memory  burden  of  multiple  templates  cannot  be 
afforded. 


As  general  conclusions  about  the  multi-section  VQ  approach,  we  offer  the 

following: 

(a)  ail  utterances  should  be  length  normalized  before  processing; 

(b)  the  normalization  length  should  be  as  long  as  computational  con¬ 
straints  permit  (up  to  the  maximum  word  length  expected); 

(c)  the  analysis  conditions  should  provide  frame  overlap; 

(d)  for  speaker-independent  recognition,  a  section  code  book  rate  of  at 
least  3  is  required; 

(e)  for  speaker-dependent  recognition,  a  section  code  book  rate  of  at  least 
2  is  required; 

(f)  short  training  sequences  cannot  be  used; 

(g)  accurate  endpoint  detection  is  important. 

The  success  of  the  multi-section  approach  is  due  primarily  to  two  things. 
First,  VQ  code  books  are  an  efficient  representation  of  the  training  data.  Second, 
multi-section  code  books  allow  flexibility  in  the  time  alignment  of  an  input  utter¬ 
ance  with  a  code  book,  but  they  enforce  sectional  time  alignment.  In  fact,  there 
is  an  analogy  in  the  time  alignment  procedures  of  DTW  and  multi-section  VQ.  Nei¬ 
ther  enforces  a  strict  sequential  frame  by  frame  comparison  of  the  input  and 
references,  and  both  And  locally  a  best  path  through  the  reference.  The  analogy 
quickly  breaks  down,  but  it  is  clear  that  the  nonlinear  time  alignment  allowed  by 
both  approaches  contributes  to  their  success. 

Our  results  are  encouraging,  but  they  were  for  a  small,  homogeneous  set  of 


speakers.  How  multi-section  VQ  will  perform  on  a  larger,  more  diverse 
population  is  an  open  question,  which  we  intend  to  investigate. 


Our  original  single-section  VQ  approach  tried,  to  model  each  vocabulary 
word  as  a  discrete  memoryless  source.  Although  the  results  were  good,  this 
model  is,  of  course,  naive.  A  better  source  model  for  an  isolated  word  is  a  Mar¬ 
kov  model,  and  many  researchers  have  used  this  idea  [30,31, 15].  Multi-section 
VQ  is  am  ad  hoc  way  of  incorporating  memory.  It  can  be  viewed  as  a  one-step 
Markov  model  with  transition  probabilities  that  are  either  zero  or  one  for  moving 
to  the  next  state  or  section  It  would  be  more  satisfying,  and  we  suspect  more 
accurate,  if  the  states  and  the  state  representations  for  a  word  were  determined 
by  the  same  criterion  as  that  used  in  designing  a  memoryless  VQ  code  book  - 
minimizing  the  distortion  between  the  training  data  and  the  representation. 
Some  steps  in  this  direction  have  been  made. 

Ostendorf  and  Gray  have  developed  an  algorithm  for  designing  both  a 
separate  zero  memory  quantizer  for  each  of  a  finite  number  of  states  and  a  set 
of  next-state  functions  depending  only  on  the  current  state  and  codeword  to 
update  the  state  [32].  Using  this  algorithm,  a  separate  finite-state  vector  quan¬ 
tizer  could  be  designed  for  each  vocabulary  word,  and  an  unknown  input  utter¬ 
ance  could  be  classified  by  encoding  it  in  each  of  the  finite-state  vector  quanti¬ 
zation  code  books,  just  as  is  now  done  with  the  multi-section  code  books.  Since 
time-sequence  information  is  implicit  in  the  next-state  function,  and  since  a 
state  code  book  is  likely  to  be  smaller  than  a  section  code  book,  the  recognition 
accuracy  should  improve  and  the  computational  complexity  should  decrease. 
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Table  I.  Male  Speaker-Independent  Recognition:  Length-Normalization  Study 


Table  VI.  Results  Using  Combined  Male  and  Female  Training  Data:  Compression 
Factor  =  4 
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Table  DC.  Full  Data  3ase  Speaker-Dependent  Confusion  Matrix:  Compression  Fac¬ 
tor  =  4.  Section  Rate  =  2 
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Table  X.  Compression  Factor  Study  For  Speaker-Dependent  Recognition:  Section 
Rate  =  0 
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